65 research outputs found
Character-Aware Neural Language Models
We describe a simple neural language model that relies only on
character-level inputs. Predictions are still made at the word-level. Our model
employs a convolutional neural network (CNN) and a highway network over
characters, whose output is given to a long short-term memory (LSTM) recurrent
neural network language model (RNN-LM). On the English Penn Treebank the model
is on par with the existing state-of-the-art despite having 60% fewer
parameters. On languages with rich morphology (Arabic, Czech, French, German,
Spanish, Russian), the model outperforms word-level/morpheme-level LSTM
baselines, again with fewer parameters. The results suggest that on many
languages, character inputs are sufficient for language modeling. Analysis of
word representations obtained from the character composition part of the model
reveals that the model is able to encode, from characters only, both semantic
and orthographic information.Comment: AAAI 201
Tree block coordinate descent for map in graphical models
abstract URL: http://jmlr.csail.mit.edu/proceedings/papers/v5/sontag09a.htmlA number of linear programming relaxations have been proposed for finding most likely settings of the variables (MAP) in large probabilistic models. The relaxations are often succinctly expressed in the dual and reduce to different types of reparameterizations of the original model. The dual objectives are typically solved by performing local block coordinate descent steps. In this work, we show how to perform block coordinate descent on spanning trees of the graphical model. We also show how all of the earlier dual algorithms are related to each other, giving transformations from one type of reparameterization to another while maintaining monotonicity relative to a common objective function. Finally, we quantify when the MAP solution can and cannot be decoded directly from the dual LP relaxation
Cutting plane algorithms for variational inference in graphical models
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 65-66).In this thesis, we give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints are derived for the marginal polytope through a series of projections onto the cut polytope. Projecting onto a larger model gives an efficient separation algorithm for a large class of valid inequalities arising from each of the original projections. As a result, we obtain tighter upper bounds on the logpartition function than possible with previous variational inference algorithms. We also show empirically that our approximations of the marginals are significantly more accurate. This algorithm can also be applied to the problem of finding the Maximum a Posteriori assignment in a MRF, which corresponds to a linear program over the marginal polytope. One of the main contributions of the thesis is to bring together two seemingly different fields, polyhedral combinatorics and probabilistic inference, showing how certain results in either field can carry over to the other.by David Alexander Sontag.S.M
Scaling all-pairs overlay routing
This paper presents and experimentally evaluates a new algorithm for efficient one-hop link-state routing in full-mesh networks. Prior techniques for this setting scale poorly, as each node incurs quadratic (n[superscript 2]) communication overhead to broadcast its link state to all other nodes. In contrast, in our algorithm each node exchanges routing state with only a small subset of overlay nodes determined by using a quorum system. Using a two round protocol, each node can find an optimal one-hop path to any other node using only n[superscript 1.5] per-node communication. Our algorithm can also be used to find the optimal shortest path of arbitrary length using only n[superscript 1.5] logn per-node communication. The algorithm is designed to be resilient to both node and link failures.
We apply this algorithm to a Resilient Overlay Network (RON) system, and evaluate the results using a large-scale, globally distributed set of Internet hosts. The reduced communication overhead from using our improved full-mesh algorithm allows the creation of all-pairs routing overlays that scale to hundreds of nodes, without reducing the system's ability to rapidly find optimal routes.National Science Foundation (U.S.).National Science Foundation (U.S.). Graduate Research Fellowship Progra
Learning efficiently with approximate inference via dual losses
Many structured prediction tasks involve
complex models where inference is computationally intractable, but where it can be well
approximated using a linear programming
relaxation. Previous approaches for learning for structured prediction (e.g., cutting-
plane, subgradient methods, perceptron) repeatedly make predictions for some of the
data points. These approaches are computationally demanding because each prediction
involves solving a linear program to optimality. We present a scalable algorithm for learning for structured prediction. The main idea
is to instead solve the dual of the structured
prediction loss. We formulate the learning
task as a convex minimization over both the
weights and the dual variables corresponding
to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with state-of-the-art methods such as
stochastic subgradient and cutting-plane
Learning bayesian network structure using lp relaxations
We propose to solve the combinatorial problem
of finding the highest scoring Bayesian
network structure from data. This structure
learning problem can be viewed as an inference
problem where the variables specify the
choice of parents for each node in the graph.
The key combinatorial difficulty arises from
the global constraint that the graph structure
has to be acyclic. We cast the structure
learning problem as a linear program over
the polytope defined by valid acyclic structures.
In relaxing this problem, we maintain
an outer bound approximation to the polytope
and iteratively tighten it by searching
over a new class of valid constraints. If an
integral solution is found, it is guaranteed
to be the optimal Bayesian network. When
the relaxation is not tight, the fast dual algorithms
we develop remain useful in combination
with a branch and bound method.
Empirical results suggest that the method is
competitive or faster than alternative exact
methods based on dynamic programming
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Data cleaning is naturally framed as probabilistic inference in a generative
model, combining a prior distribution over ground-truth databases with a
likelihood that models the noisy channel by which the data are filtered,
corrupted, and joined to yield incomplete, dirty, and denormalized datasets.
Based on this view, we present PClean, a unified generative modeling
architecture for cleaning and normalizing dirty data in diverse domains. Given
an unclean dataset and a probabilistic program encoding relevant domain
knowledge, PClean learns a structured representation of the data as a
relational database of interrelated objects, and uses this latent structure to
impute missing values, identify duplicates, detect errors, and propose
corrections in the original data table. PClean makes three modeling and
inference contributions: (i) a domain-general non-parametric generative model
of relational data, for inferring latent objects and their network of latent
connections; (ii) a domain-specific probabilistic programming language, for
encoding domain knowledge specific to each dataset being cleaned; and (iii) a
domain-general inference engine that adapts to each PClean program by
constructing data-driven proposals used in sequential Monte Carlo and particle
Gibbs. We show empirically that short (< 50-line) PClean programs deliver
higher accuracy than state-of-the-art data cleaning systems based on machine
learning and weighted logic; that PClean's inference algorithm is faster than
generic particle Gibbs inference for probabilistic programs; and that PClean
scales to large real-world datasets with millions of rows.Comment: Added references; revised abstrac
Overcomplete Independent Component Analysis via SDP
We present a novel algorithm for overcomplete independent components analysis
(ICA), where the number of latent sources k exceeds the dimension p of observed
variables. Previous algorithms either suffer from high computational complexity
or make strong assumptions about the form of the mixing matrix. Our algorithm
does not make any sparsity assumption yet enjoys favorable computational and
theoretical properties. Our algorithm consists of two main steps: (a)
estimation of the Hessians of the cumulant generating function (as opposed to
the fourth and higher order cumulants used by most algorithms) and (b) a novel
semi-definite programming (SDP) relaxation for recovering a mixing component.
We show that this relaxation can be efficiently solved with a projected
accelerated gradient descent method, which makes the whole algorithm
computationally practical. Moreover, we conjecture that the proposed program
recovers a mixing component at the rate k < p^2/4 and prove that a mixing
component can be recovered with high probability when k < (2 - epsilon) p log p
when the original components are sampled uniformly at random on the hyper
sphere. Experiments are provided on synthetic data and the CIFAR-10 dataset of
real images.Comment: Appears in: Proceedings of the 22nd International Conference on
Artificial Intelligence and Statistics (AISTATS 2019). 21 page
- …